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Interaction networks are of central importance 
in post-genomic molecular biology, with increas- 
ing amounts of data becoming available by high- 
throughput methods. Examples are gene regula- 
tory networks or protein interaction maps. The 
main challenge in the analysis of these data is 
to read off biological functions from the topol- 
ogy of the network. Topological motifs, i.e., pat- 
terns occurring repeatedly at different positions in 
the network have recently been identified as basic 
modules of molecular information processing. In 
this paper, we discuss motifs derived from fami- 
lies of mutually similar but not necessarily identi- 
cal patterns. We establish a statistical model for 
the occurrence of such motifs, from which we de- 
rive a scoring function for their statistical signifi- 
cance. Based on this scoring function, we develop 
a search algorithm for topological motifs called 
graph alignment, a procedure with some analogies 
to sequence alignment. The algorithm is applied 
to the gene regulation network of E. coli. 

The vast amount of sequence data collected over 
the past two decades is at the heart of quantita- 
tive molecular biology. Biological information is ex- 
tracted from these data mainly by analyzing similar- 
ities between sequences. This approach is based on 
efficient sequence alignment algorithms and a statisti- 
cal theory to assess the significance of the results (see 
m). Its ultimate goal is to infer functional relation- 
ships from correlations between sequences. Over the 
last few years, however, it has become clear that func- 
tions in many cases cannot be identified at the level of 
single genes. A given function may require the coop- 
erative action of several genes, and conversely, a given 
gene may play a role in quite different functional con- 
texts. The genome is thus a highly interactive system 
and the expression of a gene depends on the activity 
of other genes. The pathways of these interactions 



are encoded in so-called regulatory networks. Sim- 
ilarly complex networks govern signal transduction, 
that is, the infiuence of external signals on gene ex- 
pression, or protein interactions, that is, the ability 
of two or more proteins to form a bound state in a 
living cell. 

A few exemplary cases of gene networks have been 
studied in much detail, such as the regulation of early 
development in the sea urchin Strongylocentrotus pur- 
puratus 2 or in Drosophila In some approxima- 
tion, these structures can be understood as logical 
networks: the expression level of a gene is reduced to 
a binary variable ( on or ojf) and is specified in terms 
of binary input data, i.e., the expression levels of its 
"upstream" genes. 

On the other hand, a large amount of data on 
molecular interaction networks is now obtained by 
high-throughput experiments, for example protein in- 
teraction maps in yeast [3j or gene expression ar- 
rays jSj. In these arrays, one probes the activity of 
an entire genome, rather than of just a few genes. 
However, the detailed logical connection of interac- 
tion pathways is typically lost. The information is re- 
duced to a topological network, that is, a set of nodes 
(representing, e.g., genes or proteins) and links rep- 
resenting their pairwise interactions. These links can 
be directed as in the case of regulatory interactions 
or undirected as for protein-protein binding. The 
amount of topological data on molecular networks 
is expected to increase rapidly in the next few years, 
paralleling the earlier explosion of sequence data. 

What can be learned from these data? Using the 
network topology alone, can we distinguish patterns 
of biological function from random background? The 
purpose of this paper is to develop a "bioinformat- 
ics" approach to the search for local modules in net- 
works. We discuss a heuristic search algorithm and 
its statistical grounding in a stochastic model of net- 
work evolution. This approach is designed to comple- 
ment experiments in specific organisms by large-scale 
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database searches. 

In two seminal studies jS] Ej , it has recently been 
shown that topological networks indeed contain sta- 
tistically significant patterns indicative of biological 
functions. These motifs are patterns which occur 
more frequently in the observed network than ex- 
pected in a suitable null ensemble. The motifs found 
so far have been identified because they occur iden- 
tically at different positions in a network. 

If network evolution is a stochastic process, how- 
ever, functionally related motifs do not need to be 
topologically identical. Hence, the notion of a motif 
has to be generalized to a stochastic one as well. Vari- 
ations arise due to uncertainties in the network data, 
or - more importantly - because some of the interac- 
tions can change without affecting the functionality 
of the motif. This "noise" is an important charac- 
teristic of biological systems, familiar from sequence 
analysis where one searches for local sequence similar- 
ities blurred by mutations and insertions/deletions, 
rather than for identical subsequences. It leads us 
to the notion of a probabilistic motif where each link 
occurs with a certain likelihood. Probabilistic mo- 
tifs arise as consensus from finding a family of "suffi- 
ciently" similar subgraphs in a network. The search 
for mutually similar subgraphs and their probabilistic 
motifs is the central issue of this paper. 

The motifs of interest here are non-random in two 
ways: they have an enhanced number of internal 
links, associated e.g. with feedback, and they ap- 
pear in a significant number of subgraphs. Identifying 
these local deviations from randomness in networks 
requires a statistical theory of local graph structure, 
which we establish in this paper. This is a comple- 
mentary approach to the global statistics measured 
by the connectivity distribution 'B' or connectivity 
correlations E3 of a network. 

Our approach leads to an algorithmic procedure 
termed local graph alignment, which is conceptually 
similar to sequence alignment. It is based on a scor- 
ing function measuring the statistical significance for 
families of mutually similar subgraphs. This scoring 
involves quantifying the significance of the individ- 
ual subgraphs as well as their mutual similarity, and 
is thus considerably more complicated than for fam- 
ilies of identical motifs. Our scoring function is de- 
rived from a stochastic model for network evolution. 
There is indeed evidence that network evolution can 
be described as a stochastic process. For example, the 
comparison of the regulatory networks for early devel- 
opment in several Drosophila species has revealed the 
continuous buildup and loss of gene interactions fol- 
lowing an approximate molecular clock ^T] . Yet little 



is known about the specific pathways of network evo- 
lution. Our scoring function is compatible with di- 
vergent evolution of subgraphs but also with conver- 
gent evolution towards a common functional motif. 
These pathways can be illustrated by a comparison 
with sequence evolution. An example of convergent 
evolution is the formation of sequence motifs serv- 
ing as binding sites of specific enzymes |12[ [T^ . An 
example of divergent evolution is a set of sequences 
stemming from a common ancestor undergoing mu- 
tations independently. The probabilistic grounding 
of graph alignment allows us to infer optimal scoring 
parameters by a maximum- likelihood procedure (I4j . 

As a computational problem, graph alignment 
is more challenging than sequence alignment. Se- 
quences can be aligned in polynomial time using dy- 
namic programming algorithms. For graph align- 
ment, a polynomial-time algorithm probably does 
not exist. Already simpler graph matching problems 
such as the subgraph isomorphism problem (decid- 
ing whether a graph contains a given subgraph) jl5l 
116) or finding the largest common subgraph of two 
graphs are iVP-complete and A^P-hard, respec- 
tively. Thus, an important issue for graph alignment 
is the construction of efficient heuristic search algo- 
rithms. Here we solve this problem by mapping graph 
alignment onto a spin model familiar in statistical 
physics, which can be treated by simulated anneal- 
ing. 

This paper is structured as follows. In the first 
part, we discuss the statistics of local subgraphs 
based on a probabilistic model. This is done in three 
steps: (i) an individual subgraph with an enhanced 
number of internal links, (ii) a subgraph in the pres- 
ence of a template motif specifying the functional im- 
portance of each link, and (iii) correlated subgraphs, 
whose common pattern is to be inferred from the data 
instead of being given as a template. We then con- 
struct a scoring function designed to distinguish sets 
of statistically significant network motifs with an en- 
hanced number of links from a background of other 
patterns. High-scoring motifs are found by an align- 
ment algorithm, details of which are described in the 
supporting text. In the second part of the paper, 
we apply this method to the regulatory network of 
Escherichia coli and discuss the probabilistic motifs 
found. The statistics of these motifs is used to test 
the assumptions of our probabilistic model. 

Graphs and patterns 

A topological network or graph is a set of nodes and 
links. Labeling the nodes by an index r = 1, . . . , N, 
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the network is described by the adjacency matrix C, 
which has entries Crr' = 1 if there is a directed 
hnk from node r to node r' and Crr' = otherwise. 
Graphs with a generic adjacency matrix are caUed di- 
rected. The special case of a symmetric adjacency ma- 
trix can be used to describe undirected graphs. The 
in and out connectivities of a node, fc+ = J^r' ^r'r 
and k~ = J^r' ^^rr', are defined as the number of in- 
and outgoing hnks, respectively. The total number 
of hnks is denoted hy K = ^, Crr'- The networks 
considered here are spars e^ i.e., their average connec- 
tivity K/N is of order 1. 

A subgraph G is given by a subset of n vertices 
{ri, . . . r„} and the resulting restriction of the adja- 
cency matrix. More precisely, we define the matrix 
c(G, A) with the entries = Cnrj = 1, • ■ • , n) 
specifying the internal links of the subgraph for a 
given order A of the nodes. This matrix c is called 
a pattern^ which is contained in the subgraph. The 
definition of a pattern used here implies that two pat- 
terns are counted as separate if the matrices c and 
c' are different. This assumes that nodes are dis- 
tinguishable by their biochemical identity and their 
functional role even if they are at symmetric posi- 
tions, i.e., if c and c' differ only by the labeling of 
the nodes. An alternative definition would count two 
matrices c and c' related by a relabeling as defin- 
ing an identical pattern. Which definition is more 
appropriate depends on the particular biological ap- 
plication. 

The most important characteristic of patterns for 
what follows is their number of internal links, 

L(c)-^c,,. (1) 

Fig. n shows two subgraphs that differ in the values 
of L. 

Graph alignments and motifs 

A graph alignment is defined by a set of several sub- 
graphs G" (a = 1, . . . ,p) and a specific order of the 
nodes {rf , . . . , r^} in each subgraph; this joint order 
is again denoted by A. For simplicity, we assume here 
that the subgraphs are of the same size n, but it is 
not difficult to generalize our approach in order to in- 
clude subgraphs of different size. For a given set of p 
mutually disjoint subgraphs, there are (n!)^ different 
alignments. An alignment associates each node in a 
subgraph with exactly one node in each of the other 
subgraphs. This can be visualized by n "strings" , 
each connecting the p nodes with the same index i as 
shown in fig. ^ (c) . 
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Figure 1: Motifs and alignment in topological net- 
works, (a) A randomly chosen connected subgraph is 
likely to be a tree, i.e., it has a the number of internal 
links links equal to its number of nodes minus 1. (b) Puta- 
tively functional subgraphs are distinguished by internal 
loops, i.e., by a higher number of internal links, (c) An 
alignment of three subgraphs with four nodes each. Each 
nodes carries an index a = 1,2,3 labeling its subgraph 
and an index i = 1,2,3,4 given by the order of nodes 
within the subgraph. Nodes with the same index i are 
joined by dashed lines, defining a one-to-one mapping 
between any two subgraphs. Network links are shown 
as solid lines (with their arrows suppressed for clarity), 
(d) The consensus pattern of this alignment. Each link 
occurs with a likelihood Cij indicated by the gray scale. 



A given alignment A specifies a pattern in each 
subgraph, we write c" = c(G",^). The consensus 
pattern of this alignment is given by the matrix 



This is a probabilistic pattern, the entry Cy denot- 
ing the likelihood that a given link is present in the 
aligned subgraphs. For any two aligned subgraphs 
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and ^ we can define the pairwise mismatch 

n 

M(c",c^') = J2 - 4) + (1 - ■ (3) 

The mismatch is if and only if the matrices c" and 
are equal, and is positive otherwise. It can be con- 
sidered as a Hamming distance for aligned subgraphs. 
The average mismatch over all pairs of aligned sub- 
graphs, M = M(c, c), is termed the fuzziness of the 
consensus pattern c. Analogously, the average num- 
ber of internal links is denoted hy L = L{c). 

We now define network motifs as statistically 
significant consensus patterns of graph alignments, 
which are distinguished by a high number of inter- 
nal links and low fuzziness. Clearly, this definition is 
mathematically loose before we quantify the statisti- 
cal significance. This will be done in the next three 
sections. 

Guided by the results of refs. [HI E], we take an 
enhanced number of internal links as a topological 
indicator of possible functional modules in networks. 
The additional links beyond a treelike topology can 
be associated with feedback or feed-forward loops in 
transcription networks, or clusters in protein inter- 
action networks. For example, the triangle shown in 
fig. n (b) can be interpreted jB] as a low-frequency 
bandpass filter: the central node is activated if both 
top nodes are active. However, the right hand node 
is activated by that on the left with a small delay, so 
the central node is activated provided the left node 
is active for a time longer than this delay. The non- 
treelike nature of this motif is crucial for its function. 
On the other hand, most randomly chosen connected 
subgraphs would be treelike. Clearly, an enhanced 
number of internal links is but the simplest topolog- 
ical indicator of putative functionality, and more de- 
tailed ways of identifying network motifs are likely to 
emerge in the future. 

Statistics of individual subgraphs 

In order to quantify the statistical significance of a 
given number of internal links, we first compute the 
relevant probability distribution in a suitable random 
graph ensemble, which is generated by an unbiased 
sum over all graphs with the same number of nodes 
and the same connectivities A:+ {r — 1, . . . , N) as 
in the data set but randomly chosen links ^1 EI] • 
This null ensemble is appropriate for biological net- 
works, whose connectivity distribution generally dif- 
fers markedly from that of a random graph with 
uniformly distributed links. In the null ensemble. 



the probability of finding a directed link from node 
ri to node rj is in good approximation given by 
Wij = k~.k^./K jni- Hence, a given subset of nodes 
{ri, . . . , r„} forms a subgraph G with probability 

n 

P,{G)^\{(l-w,i)'-^'^w^. (4) 

This expression neglects double links, which can be 
included as in JU]. The probability Pq{G) depends 
on the pattern c{G) of the subgraph, as well as on 
its environment given by the connectivities k'^ , k~ 
(i = 1, . . . ,n). In this ensemble, the expected num- 
ber of internal links per node is small, {L)a/n ~ n/N ^ 
where we denote the average over a given ensemble 
by (). Hence, most random subgraphs in a large and 
sparse graph are disconnected. Within the subset of 
connected subgraphs, most are treelike. (Later we 
will be interested in the subset of non-treelike sub- 
graphs, and this will require a modification of the 
null ensemble.) 

We now assume that subgraphs containing network 
motifs are generated by a different ensemble P„{G). 
The probability that a given pair of nodes carries a 
link is enhanced by a factor e"' relative to the null 
ensemble 0, leading to 

P,{G)/Pa{G) = Z-^ cxp[aL(c)] . (5) 

Again the probability Pa{G) that a given subset of 
nodes {ri, . . . , r„} forms a subgraph G depends on 
the matrix c(G). We have introduced the normal- 
ization factor Zo- = riij Z]cij=o,i "5^P['^-^(^)] ^o(G), 
which ensures that Pa-{G) summed over all matrices 
c gives unity. The quantity cr, called the link re- 
ward, is multiplied by the total number L of internal 
links given by (Q). The ensemble (0) is a statisti- 
cally unbiased way to describe that functional motifs 
are distinguished by a large number of internal links. 
(Technically, it is the ensemble of maximal informa- 
tion entropy with a given average link number (L), 
which is determined by the value of a.) This ensemble 
may be thought of as resulting from an evolutionary 
process favoring the formation of links due to selec- 
tion pressure; such a process has recently been stud- 
ied for regulatory networks |^. Here we focus on 
the detection of evolved motifs rather than on the re- 
construction of evolutionary histories. Hence, we do 
not need to make assumptions on dynamical details 
of motif formation but only on its outcome, which is 
described by the ensemble Pa{G). We have tested 
the form of this ensemble for the regulatory network 
of E. coli as discussed in the results section below. 
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Moreover, the value of the hnk reward cr can be in- 
ferred from the data. One finds e"' ^ N/n, which 
results in a finite expected number (L) /n of internal 
links per node within a motif. 



Statistics in the presence of a template 

The distribution |(SJ) describes an ensemble with an 
enhanced number of links, which is appropriate for 
scoring individual subgraphs in the absence of further 
knowledge. Consider now an evolutionary process di- 
rected towards a given network motif represented by 
a template adjacency matrix t. An alignment A be- 
tween the motif t and the subgraph G is specified by 
a given ordering of the nodes {ri, . . . , r„} in G. The 
outcome of this evolutionary process can be modeled 
by an ensemble Qt{G,A) with a bias against links 
that do not occur in the template, 



QtiG,A)/PoiG) = Z^'exp 



ai(c)-^M(c,t) 



(6) 

This denotes the probability that a given subset of 
nodes {ri, . . . , r„} forms an aligned subgraph (G, A), 
with the definition ^ of the pairwise mismatch of the 
subgraph G and the template t in a given alignment 
A. Again Zt is given by normalization. This is a 
hidden Markov model: the outcome of the stochastic 
process is an aligned subgraph (G, A) while only G 
is observed. The likelihood of observing G is then a 
sum over all alignments, 



Qt(G) = ^Qt(G,^). 



(7) 



This ensemble has two free parameters, the link re- 
ward (7 and the mismatch penalty fi (with a factor 1 /2 
introduced for later convenience). It is conceptionally 
similar to hidden Markov models for the alignment of 
sequences with gaps. 



Statistics of correlated subgraphs 

Now we turn to the case where a network motif is 
not given as a template but has to be inferred from a 
family of suitably aligned subgraphs. The underlying 
evolutionary process can be regarded as a biased link 
formation as in the previous section, with the con- 
sensus pattern c as "template" . Assuming the link 
formation is independent for each subgraph, we ob- 



tain an ensemble given by 
p 

g,,^(G\...,Gf,^)/I]Po(G") 



(8) 



exp 



exp 



a5]i(c")-f ^Af(c",c) 



a=l 
P 



ct=l ^ a, 13=1 



where A specifies an alignment of all subgraphs and 
we have used the definition |(5J) of the consensus pat- 
tern. The normalization is given by 



E E ^'^p 

A ci,...,c!' 



(9) 



2p 



a,/3=l 



n poiG'^)- 



The scoring function 

We now construct a scoring function designed to se- 
lect a set of (putatively) functional subgraphs - char- 
acterized by a consensus motif with a high number 
of internal links and low fuzziness - from the back- 
ground of random subgraphs in a large network. 

Based on the preceding discussion, we assume that 
the statistics of functional motifs is described by an 
ensemble Q{G\ . . . ,Gp,A) = Q<,,^(G\ . . . , G^, ^), 
where the scoring parameters a and fi remain to be 
determined from the data. 

For the biological applications described above, 
where internal links are associated with feedback 
loops, it is clearly useful to restrict the motif search 
to the set of all connected subgraphs which contain 
internal loops, i.e., which are non-treelike. For con- 
nected subgraphs of size n, this set is given by the 
constraint L > n on the internal link number. A 
large random graph typically contains a number of 
order one of such subgraphs, and these define the rel- 
evant null ensemble for motif search. We model these 
subgraphs using the ensemble P^^ with an enhanced 
number of links defined in (jSJ. The parameter ctq 
will be adjusted such that the average number of in- 
ternal links in the null ensemble equals that found in 
the non-treelike subgraphs of a suitable randomized 
graph. Comparing with the ensemble Pq of random 
subgraphs introduced earlier, it is clear that the con- 
straint L > n corresponds to a link reward (Tq > 0- 
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Given these two ensembles, we define the log- otherwise. The resulting Hamiltonian Ti is 
likelihood score 



S{G\...,GP,A) 
= log 



P,„(Gi,...,GP,^) 

a=l ^ a, 13=1 



(11) 



a=l 



-log(Z„^^/Z^J , 



(10) 



which is positive if a set of subgraphs , . . . , and 
an alignment A between them is more likely to occur 
in the ensemble Qa,p, than in the null ensemble P^o- 
The term log{Z^^^/Za-o) acts as a threshold assigning 
a negative score to alignments with too large fuzziness 
or a too small number of internal links. 

As is clear from the form of the scoring function, 
graph alignment is a nontrivial optimization problem, 
the statistical weight of each subgraph G" depending 
on the scoring parameters as well as on the other 
subgraphs included in the alignment. We address this 
problem in two steps. First we find the maximum- 
score alignment(s) for given score parameters, which 
is essentially an algorithmic search problem. Then 
we discuss the parameter dependence of high-scoring 
alignments and obtain the optimal values of a and 
/i for a given data set from a maximum-likelihood 
procedure. 



MEiximum score alignments and parametric 
optimization 

Finding the maximum score alignments involves a 
huge search space of possible alignments. The num- 
ber of alignments is of order (np)^ for given p and 
the computational expense grows further when the 
optimization over p is performed. Here we use a 
heuristic algorithm, which can be described by a 
mapping to a discrete spin model. First we enu- 
merate all non-treelike subgraphs of n nodes, which 
is feasible for modest values of n, and label them 
by the index a = l,...,p„iax- Next we evaluate 
the internal link numbers i" = -^(c") and the pair- 
wise mismatches M"^, defined as the minimum of 
M(c",c'') over all pairwise alignments of the sub- 
graphs G" and . High-scoring multiple alignments 
are then found by a simulated annealing algorithm in 
the space (s^, . . . , s^™^''), where each "spin" takes 
the value 1 if G" is included in the alignment and 



a, 0=1 



where p = s" . The coupling between and 
is given by M°'^ , which is equal to the pairwise mis- 
match M"'^ if subgraphs a and /3 do not overlap, and 
a large positive constant if they do. (Two subgraphs 
overlap if they have more than one node in common. 
According to this definition, links in non-overlapping 
subgraphs form independently as assumed in ©•) 
The threshold term log{Z^^^/Za-o) is evaluated by 
saddle-point integration, details are given in the sup- 
porting text. Simulated annealing using the Hamil- 
tonian Hill) will then yield high-scoring alignments of 
non-overlapping subgraphs ,22, . 

For fixed values of the scoring parameters, the algo- 
rithm is expected to produce well-defined maximum- 
score alignments. This can be understood as fol- 
lows. For a (hypothetical) alignment of subgraphs 
with equal number of internal links and equal pair- 
wise mismatches, the score (|10|l scales linearly with 
p, the number of aligned subgraphs. This is consis- 
tent with the interpretation of H1U|) as a log-likelihood 
score, since the aligned subgraphs occur indepen- 
dently. A high-scoring alignment in a realistic net- 
work may consist of a limited number of identical or 
very similar motifs. As we extend this alignment to 
include more subgraphs, subgraphs with increasing 
mutual mismatches are included. Hence, we expect 
the total mismatch to increase faster than linearly 
with p, leading to a maximum S* (cr, fj,) of the total 
score at some intermediate value of p* (ct, fi) . 

The properties of the maximum-score alignments 
depend strongly on the parameters a and fx. With in- 
creasing (T, the number of internal links L*{a,fi) per 
subgraph is expected to increase. With increasing /i, 
both the number of graphs p* (tr, /x) and the fuzziness 
M (a, /i) decrease. In this way, the maximum-score 
alignment varies between a set of independent sub- 
graphs for fi — and a set of identical subgraphs 
with identical motifs for fi oo. 

A maximum-likelihood approach can be used to 
infer the optimal scoring parameters a* , n* for a given 
data set, which we obtain as the point of the global 
score maximum S* = maxo-,^ S* (cr, /i) . 
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Results and Discussion 

In this section, wc discuss the appHcation of 
local graph alignment to motif search in the 
gene regulatory network of E. coli, taken from 
|http : / / www. weizmann . ac . il/mcb /Uri Alon / 
Network_motifs_in_coli/ColiNet-l.l/, containing 424 
nodes and 577 directed links. Each labeled node 
in this network represents a gene. A directed link 
between two nodes signifies that the product of the 
gene represented by the first node acts as a transcrip- 
tion factor on the gene represented by the second. 
Throughout we consider motifs with a fixed number 
of nodes n — 5. 

First we show that our algorithm indeed produces 
well-defined alignments of maximal score (i.e., of 
maximal relative likelihood). For fixed parameters 
cr = 3.8 and /i = 4.0, this is illustrated by Fig.|21(a), 
which shows the score S and the fuzziness M for the 
highest-scoring alignment with a prescribed number 
p of subgraphs, plotted against p. As expected, the 
fuzziness increases with increasing p, and the total 
score reaches its global maximum S* (cr, /i) at an in- 
termediate value p* (cr, /i) . It is lower for p < p* (cr, /i) 
since the alignment contains less subgraphs and for 
p > p* (cr, ^) since the subgraphs have higher mutual 
mismatches. Fig.|21(b) shows the score 5*(cr, /i) as a 
function of ct and /i. This function has a unique global 
maximum S* , which defines the maximum- likelihood 
point (cr* = 3.8, fi* = 2.25, p* = 24). 

The scoring parameter erg of the null ensemble Pa-g 
is determined as follows: the data set is randomized 
by generating a network with the same connectivities 
as in the data set but randomly chosen links irU] . 
Again, the non-treelike subgraphs are extracted and 
their average number of internal links is determined. 
The value of ctq is uniquely determined by the con- 
dition that the expected number of links in the null 
ensemble (jSJ equals the average number of internal 
links found in the non-treelike subgraphs. One ob- 
tains crp = 2.45. As expected, ao < a* , which shows 
that the data set has an enhanced number of internal 
links relative to the randomized network. 

At the maximum-likelihood scoring parameters, we 
can moreover verify the functional form of the en- 
sembles used to construct the score function ((Tn|) . In 
order to test the model (0) for individual subgraphs, 
we enumerate all subgraphs with n = 5 that have 
non-treelike patterns (i.e., a link number L > 5). All 
ordered pairs of nodes i,j are then binned according 
to the the probability Wij of a directed link existing 
between them in the ensemble Po{G). In Fig. |31 (a) 
the fraction of these pairs i , j carrying a link is plotted 




(b) 

Figure 2: Maximum score alignment and para- 
metric optimization, (a) Score optimization at fixed 
scoring parameters a — 3.8 and fi = 4.0. The total score 
S (thick line) and the fuzziness M (thin line) are shown 
for the highest-scoring alignment of p subgraphs, plotted 
as a function of p. (b) The score S* (cr, /j.) plotted against 
the parameters ^ and cr. The unique maximum S* de- 
fines the maximum-likelihood parameters a* — 3.8 and 
^i* = 2.25. 

against w (square symbols). The expectation value 
of this fraction is given by (jsj as e"'w/{l ~ w + e"'w), 
shown as a solid line with a fit value cr* = 3.8. 

Our model © for generic alignments can be tested 
in a similar way. From this ensemble, the marginal 
probability that a given ordered pair of nodes speci- 
fied by a, z, j is linked can be computed. We group all 
such pairs with the same expectation value (c^)^',^!' 
according to (jSJ to build a histogram. For each 
group, the average of cfj over node pairs in the actual 
maximum-likelihood alignment is computed and plot- 
ted against the model prediction, see fig. ISIb) . The 
same procedure is repeated for the two-point correla- 
tions {cfjC^j) a' ,11* between associated nodes in differ- 
ent subgraphs a and /? as also shown in fig. Otb) . In 
both histograms, the data points cluster well around 
the straight line equating expectation values in the 
model and averages in the actual alignment. The 
fluctuations seen reflect the limited size of the data 
set and the small number of fltting parameters in 
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Figure 3: Statistics of motif ensembles, (a) Test- 
ing the statistical model for single subgraphs Non- 
treelike subgraph are enumerated and node pairs i,j 
binned according to Wij. The fraction of such pairs car- 
rying a link is shown against Wij. The solid line results 
from fitting the model with enhanced number of links ||KJ 
to this data, giving a = 3.8. (b) Testing the statistical 
model for alignments (|HJ. Top: The average value of cfj 
over all a,i,j with a given expectation value of ac- 
cording to JSJ at a = o"* = 3.8 and fi — fi* = 2.25 against 
the corresponding expectation value (squares). For a per- 
fect fit between model and data a straight line is expected 
(shown solid) . Bottom: The same procedure is used aver- 
aging the two-point function cfjcf^ over all a,j3,i,j with 
a given expectation value {cfjcf^}. 



the model. For such data, more detailed models can 
hardly be tested since they would lead to overfitting. 

We now turn to the probabiUstic consensus motifs 
found in the data for different number of nodes n — 4 
and n = 5. Fig.^fa) shows the n = 4 consensus motif 
Cij at consecutive values of = /z* = 3.6, fi = 8, and 
fi = 15. The gray-scale encodes the average num- 
ber of links Cij between a given pair of nodes. As 
expected, the fuzziness decreases with increasing val- 
ues of the mismatch penalty fj, and Cy tends either 



to zero (no link present) or one (link present with 
certainty) as /i — > cxd. The consensus motif is a lay- 
ered structure, in this case with 2 input and 2 output 
nodes. 

A similar motif is found for n = 5. Fig.Ql^b) shows 
the n — 5 consensus motif at consecutive values of 
fi — 2.25, 5, and 12. As in the case of n = 4, a lay- 
ered structure is clearly discernible: the motif con- 
sists of 2 -|- 3 nodes forming an input and an output 
layer, with links largely going from the input to the 
output layer. The left node of the input layer has an 
average number of about 30 outgoing links. These 
connectivities are exceptional since the average out- 
connectivity of the network is 1.36. 

Comparing the alignments of subgraphs of n = 4 
nodes with those of n = 5 nodes in figure ^ one 
finds that many of the subgraphs found in the n = A- 
alignments also are a part of the subgraphs found in 
the n = 5-alignments. This immediately leads to the 
question of how to identify larger patterns in the net- 
work from which the subgraphs at a given value of n 
are taken. Obviously any scoring scheme operating 
at a fixed number of nodes n will be blind to the com- 
binatorial possibilities of selecting subgraphs from a 
larger pattern. The phenomenon is exemplified in fig- 
ure ^c). From the 3-by-4 pattern 2 non-overlapping 
layered subgraphs with n = 4 and n = 5 can be 
generated (non-overlapping subgraphs have at most 
one node in common, see above). Larger patterns 
generate correspondingly more non-overlapping sub- 
graphs. In the supporting text, we discuss a simple 
scheme which allows to identify larger patterns as in 
figure 0fc) from smaller subgraphs. The pattern of 
figure ^c) is found twice in the data, contributing in 
total 4 non-overlapping subgraphs to the alignments 
with n — 4 and n = 5. The statistics of these pat- 
terns at the level of identical patterns has recently 
been analyzed in their treatment using proba- 
bilistic patterns remains a future development. 

Figure ^d) shows details of the alignments pro- 
ducing the consensus motifs at n = 5, namely, the 
number of subgraphs p* {a = a* , fi) and the fuzzi- 
ness M [cr = a* , ji) plotted as a function of fi. For 
pL > 12 the fuzziness reaches zero and the alignment 
contains 10 identical, non-overlapping motifs. This 
layered pattern has been found by the approach of 
ref. which is based on counting identical motifs. 
However, the maximum-likelihood alignment occurs 
at /i* = 2.25 and contains a much larger number of 
p* — 24 non-overlapping subgraphs, leading to the 
probabilistic consensus motif shown on the left in 
fig.^Ja). The same effect is found in the consensus- 
motif of size n — A. Furthermore, at arbitrary non- 
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(d) 

Figure 4: Probabilistic motifs in the E. coli tran- 
scription network, (a) Consensus motifs with n — 4 
nodes at different values of ^. From left to right, ^ — 
jj,* = 3.6, fj, = 8, and /i = 15. The gray-scale of the 
links indicates the likelihood that a given link is present 
in the aligned subgraphs; the 5 gray values correspond to 
c in the range 0.1 - 0.2, 0.2 - 0.4, 0.4 - 0.6, 0.6 - 0.8, 
0.8 — 0.9, links with c > 0.9 are shown black. The link 
reward is kept fixed at cr = CT* = 3.6 and ctq takes on 
the value 3.15. (b) Consensus motifs with n — 5 for dif- 
ferent jj. = = 2.25, /J, = 5, and jj, = 12 (left to right) 
a.t a — a* — 3.8. (c) This pattern with n = 7 is found 
twice in the data-set. From each such subgraph 2 non- 
overlapping layered subgraphs with n = 4 and n — 5 can 
be generated, (d) The number p*{a* of subgraphs in 
the maximum score alignment (thick line) and the fuzzi- 
ness A/ {a*,jj,) (thin line) as a function of /i for n — 5. 



zero fuzziness the probability that a given pair of sub- 
graphs have identical motifs decreases with subgraph 
size. As a result, counting identical motifs, rather 
than following a probabilistic approach as the one 
presented here, will miss an fraction of relevant sub- 
graphs present in the data which increases with the 



size of the subgraph. 

The probabilistic grounding of motif search is 
also indispensable for quantitative significance esti- 
mates of the results obtained. Here we compare the 
maximum-likelihood alignment in the E. coli data set 
with suitable random graph ensembles. We do this 
in two steps, in order to disentangle the significance 
of the number of internal links, and of the mutual 
similarity of patterns found in the data. 

(i) To assess the significance of the number of in- 
ternal links, we consider the ensemble of graphs with 
the same in- and out-connectivities as the data set 
but randomly chosen neighbors ^1 and compute 
the distribution of the score with scoring parame- 
ters (7 = (T*, /i = 0. The null distribution of scores 
from the randomized graph has the average and stan- 
dard deviation given by S* — 5.7 ± 2.1. The score 
S* = 73.1 found from the data is thus significantly 
higher, indicating an enhanced link number with re- 
spect to the random graph ensemble, (ii) The signifi- 
cance of the mutual similarity of the aligned patterns 
is assessed by comparing the data to mutually in- 
dependent random subgraphs with the same average 
density of links. (This null ensemble is generated by 
randomizing the internal links of each subgraph inde- 
pendently.) We then compute the score with param- 
eters a = ao = cr* and /i = /x* (thereby only focusing 
on the fuzziness of the data relative to that found in 
the ensemble of uncorrelated subgraphs). This null 
distribution of scores has average and standard devia- 
tion given by S* = 27.1 ±6.3; the corresponding score 
S* = 50.1 found from the data is thus significantly 
higher. We note that the assessment of subgraph sim- 
ilarity is quite subtle. Subgraphs taken from a large 
but finite random graph may show a 'spurious' mu- 
tual similarity with respect to independent random 
subgraphs due to a prevalence of internal loops. 

The statistical significance of the results can be 
formulated more precisely using so-called p-values, 
which involve the tail of the score distribution in the 
random graph ensemble. Fast and reliable p- value es- 
timates are crucial for the search in large databases, 
as it is well known for sequence alignment |21) . This 
approach can be carried over to the graph alignments 
discussed here. 

The statistical framework presented is very flexi- 
ble. For example, as large-scale data on the logic 
of gene regulation become available, the definition 
of the pairwise mismatch (jSJ can be extended to 
reward aligning sets of nodes performing the same 
logical function. In this way, features of motifs go- 
ing beyond their topology can be explored. Simi- 
larly, simple modifications of the mismatch score al- 
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low the analysis of undirected networks, networks 
whose links have a specific function (repressive or en- 
hancing) or whose interaction strength is quantified 
by a real number. The statistical framework pre- 
sented is very flexible. For example, as large-scale 
data on the logic of gene regulation become avail- 
able, the definition of the pairwise mismatch Q can 
be extended to reward aligning sets of nodes perform- 
ing the same logical function. In this way, features of 
motifs going beyond their topology can be explored. 
Similarly, simple modifications of the mismatch score 
allow the analysis of undirected networks, networks 
whose links have a specific function (repressive or en- 
hancing) or whose interaction strength is quantified 
by a real number. 

The prospect of a sizable amount of new data on 
biological networks becoming available over the next 
few years through high-throughput methods opens 
exciting opportunities to identify the building blocks 
of molecular information processing in a wide range 
of organisms, and even build phylogenetic histories of 
regulation from transcription network data. 
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Supporting Text 

In the supplementary material we first give additional 
details on the alignment algorithm and then discuss a 
simple procedure for identifying larger modules gen- 
erating multiple smaller subgraphs. 
The algorithm proceeds in four stages: 

1. By enumeration all unique non-treelike sub- 
graphs of size n are found. We consider only sub- 
graphs where each node carries at least two inter- 
nal links, other than self-links ("exclusion of dan- 
gling links" ) . The reason is that including dan- 
gling links would generate from each subgraph 
an artificially inflated family of subgraphs gen- 
erated by including all combinations of neighbor- 
ing nodes into the subgraph. The enumeration is 
done by first finding all closed paths in the graph 
of length shorter than 2n — 3. (The maximum 
length derives from considering the non-treelike 
structure with the longest pathlength from the 
origin through all points of the subgraph back 
to the origin. The graph is considered as undi- 
rected at this stage.) The subgraphs are labeled 
by a = 1, . . . 

2. The pairwise minimal mismatch M"^ for all 
pairs of subgraphs a,/3 is found by enumerating 
all n\ possible alignments of each pair of sub- 
graphs. For each pair of subgraphs a and /3 we 
determine whether they overlap by counting the 
number of nodes they have in common. The ele- 
ments of the coupling matrix M"^ in the Hamil- 
tonian lTTI are given by M'^^ if the subgraphs do 
not overlap, and by a large number, chosen to 
be 10, if they do. 

3. The next task is to select a subset of the sub- 
graphs such that the total score lTol is maximized 
at given values of the scoring parameters. To this 
end simulated annealing is used, with the (nega- 
tive) score as the energy function, increasing the 
inverse temperature from to 10 in 1000 Monte- 
Carlo sweeps. We assign each subgraph a spin 
variable: spin s" = 1 implies that the subgraph 
a is included in the alignment, spin s"' — that 
it is not. The contribution to Eq. 1101 from the 
mismatch of two subgraphs acts as a coupling be- 
tween their spins, the contribution of subgraph 
a to the total number of links L in 1101 acts as a 
local field, resulting in the Hamiltonian llll The 
evaluation of the last term in 1101 log(Z/Zo), is 
discussed below. 

The last step is repeated at different values of a and 



H in order to perform the parametric optimization 
leading to Fig. [Jt- The parameter ctq describing the 
null ensemble, on the other hand, is determined inde- 
pendently by considering the non-treelike subgraphs 
found in the randomized graph as described in the 
paper, cto is chosen such that the average number 
of internal links in the ensemble of uncorrelated sub- 
graphs with an enhanced number of links [31 equals 
that of the non-treelike subgraphs found in the ran- 
domized network, 

I P _ 

~ / ,(-^(^ ))crn,u— ^ -^randomized ■ 
a — 1 

Note that the ensemble |S1 still depends on the con- 
nectivities of the nodes in each subgraph. The gen- 
eralization to several groups of subgraphs, where only 
subgraphs from the same group are aligned with each 
other, can be done by admitting more states of the 
Potts-like spins s" . For g-state spins this would group 
the subgraphs into q — 1 clusters much like in super- 
paramagnetic clustering (1). 

There are two approximations behind this algo- 
rithm. First, treelike subgraphs are excluded from 
the start. This step cuts down an enormous number 
of combinatorial possibilities associated with treelike 
subgraphs, which, different connectivities apart, are 
always locally similar. 

Second, it uses the minimal mismatch obtained 
from the pairwise alignment of subgraphs (step 2), 
even though the minimal mismatch obtained by 
aligning a set of more than two subgraphs may be 
higher than that of the sum of pairwise alignments. 
This is easily seen by comparing the total number 
of alignments of all pairs chosen from p subgraphs, 
(^j^!)p(p-i)/2^ with the number of alignments of p sub- 
graphs, {n\)P. However, in the case of interest, where 
the graph contains multiple copies of a motif (possi- 
bly corrupted by noise), the sum of pairwise minimal 
mismatches will typically be very close to the mini- 
mal mismatch obtained from aligning all subgraphs 
simultaneously. 

The maximal-score alignment A* (a, fi) turns out 
to be unique in most subgraphs. To see this, con- 
sider all alignments .4" of a particular subgraph a 
with respect to the other subgraphs whose alignment 
is kept fixed. Two different alignments have the same 
score if and only if there is a permutation of the 
nodes rf , . . . ,r" leaving both the adjacency matrix 
cfj and the matrix wfj defined above Eg . l4l invariant . 
While symmetries of the adjacency matrices occur 
frequently, entries of the matrix Wij are unique in 
most subgraphs, since the connectivities in biological 



networks are broadly distributed. 

We now discuss the normalizing constant [51 of the 
alignment ensemble |8l We approximate the likeli- 
hood of given parameter values, which involves the 
sum over all alignments ,4 as in [S] by the corre- 
sponding maximum-score alignment. (In the liter- 
ature for sequence alignment, this is known as the 
Viterbi approximation.) An improved likelihood es- 
timate is possible using probabilistic graph align- 
ment algorithms but is not expected to alter our re- 
sults qualitatively. The optimal alignment has p* = 
p*{a* , fi*) subgraphs with average internal link num- 
ber L = L (cr*, /u*) and fuzziness M =M {a*,fi*). 
As may be seen by differentiation of 1101 with respect 
to the scoring parameters, at u = cr* and fi = ii* the 
Q ensemble fits to the data set in the sense that the 
expectation values of the internal link number and 
the fuzziness equal the actual values. 
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a, 13=1 

The normalizing constant |9l needs to be computed 
for two sets of parameters; for tr, /z characterizing 
the Q ensemble, and for ao, /io characterizing the 
Qq ensemble. In both cases the normalizing constant 
consists of a trace over the link configuration {c"} 
in all subgraphs. Since the constant |2l factorizes in 
the link labels we consider only a single of these 
factors (a single "string"), drop the i,j indices, and 
separate the bilinear form of the pairwise mismatch 
|3]into a quadratic and a linear part. 

Formally, this expression is the partition function 
of a mean-field ferromagnet in a fluctuating field. The 
field depends on the local connectivities of each node 
along the "string" via the ensemble Pq, Eq. 01 Using 
a Hubbard-Stratonovich transformation to linearize 
the quadratic term, the trace over {c"} can be per- 
formed, giving 



Z 



where 



g°'{t)= log (l-w")-f w"exp|y2^t + cr-^| 

For large p this expression can be evaluated by a sad- 
dle point integral, giving 



where t* maximizes the exponent in Eq.l2. The con- 
tribution to leading order of adding a new subgraph 
with index a is thus 

AlogZ w -t*V2 + 5"(t*) • 

The change of t* as a finite number of subgraphs is 
added to or removed from the alignment alters the 
result □ only by terms of order p^^. It thus turns out 
to be sufficient to update the saddle-point value t* 
for each link i,j once per Monte-Carlo sweep of the 
algorithm. 

In order to compute „ for each pair i,j in the 
Viterbi approximation, the one-to-one mapping be- 
tween nodes in each subgraph A is needed, going be- 
yond the pairwise alignment. This mapping is also 
needed to produce the plots of the consensus motifs 
in Fig. ^ It is produced by minimizing the fuzziness 
over the mapping between nodes in each subgraph, 
again using Monte-Carlo dynamics (100 Monte-Carlo 
sweeps while linearly increasing the inverse tempera- 
ture from to 10). The result of course depends on 
the subgraphs in the alignment, and thus the map- 
ping ought to be updated each time a subgraph is 
added or removed from the alignment. In practice, 
however, one update of the mapping between nodes 
in each subgraph every 250 steps of the algorithm is 
sufficient. The reason for this is again that the map- 
ping between nodes in subgraphs in the alignment is 
unchanged as motifs sufficiently close to the consen- 
sus motif enter/leave the alignment. 

Finally, we discuss a simple procedure for identi- 
fying these structures from the set of subgraphs at 
fixed (small) n. First, for a given subgraph of size n, 
all neighbors with at least two links to the subgraph 
are enumerated. In this way, non-treelike subgraphs 
without dangling bonds with n + 1 nodes are gener- 
ated. This procedure is repeated for the entire list 
of p subgraphs. Several subgraphs of size n + I will 
occur repeatedly in this list: the more subgraphs of 
size n can be be generated from the larger n + I sub- 
graph the more frequently it occurs on the list. Thus 
ranking the n -\- 1 subgraphs according to the num- 
ber of times they occur in the list, one obtains the 
n + 1 subgraph from which the largest number of n 
subgraphs derive. Clearly this procedure can be re- 
peated iteratively, leading to subgraphs of increasing 
size. In fact. Fig. 01 c is the result of applying this 
scheme to the non-treelike subgraphs with n = 5. It- 
erating twice, one finds two instances of the n = 7 
layered structure of Fig. 
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